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Abstract 


Prior research suggests that subscores from a single achievement test seldom add 
value over a single total score. Such scores typically correspond to subcontent areas in 
the total content domain, but content subdomains might not provide a sound basis for 
subscores. Using scores on an inferential reading comprehension test from 625 third, 
fourth, and fifth graders, two new methods of creating subscores were explored. 
Three subscores were based on the types of incorrect answers given by students. The 
fourth was based on temporal efficiency in giving correct answers. All four scores 
were reliable. The three subscores based on incorrect answers added value and valid- 
ity. In logistic regression analyses predicting failure to reach proficiency on a statewide 
test, models including subscores fit better than the model with a single total score. 
Including the pattern of incorrect responses improved fit in all three grades, whereas 
including the comprehension efficiency score only modestly improved fit in fourth and 
fifth grades, but not third grade. Area under the curve (AUC) statistics from receiver 
operating characteristic (ROC) curves based on the various models were higher for 
models including subscores than those without subscores. Implications for using mod- 
els with and without subscores are illustrated and discussed. 
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Guidelines for educational assessment often recommend reporting diagnostic infor- 
mation to guide students and teachers in addressing specific needs (e.g., the 
Elementary and Secondary Education Act of 2001 and 2015), but how can we 
increase the diagnostic information provided by assessments? One way is by report- 
ing subscores as well as a total score, although researchers have found that subscores 
often provide relatively little added value over a single total score (Haberman, 
2008a; Lyren, 2009; Puhan, Sinharay, Haberman, & Larkin, 2010; Sinharay, 2010). 
For instance, Puhan et al. found that in a mathematics test for beginning teachers 
containing four subcontent areas (concepts, integrate knowledge, models, real-life 
problems), one could estimate a person’s true score in one of the content areas (say 
concepts) more accurately from their total test score than from their subscore in con- 
cepts. In these studies, subscores most often corresponded to the number of items 
correct in subareas of the broader content domain (e.g., real-life problems in the test 
of mathematics). However, this is not the only way to create subscores. For instance, 
subscores can be used to quantify the number of incorrect responses of a given type 
that a student makes or the efficiency with which a student can correctly answer 
items. 

In addition, past evaluations of subscore utility have been based primarily on the 
internal structure of the subscores: factor structure or internal consistency reliability 
(e.g., Puhan et al., 2010; Sinharay, 2010). They typically do not consider the incre- 
mental validity provided by subscores over and above a total score. The premises of 
this research are that, given the previous findings on subscores, the field needs to 
explore other ways of constructing them and needs to explore properties of subscores 
beyond their internal structure. To support this argument, results for a reading com- 
prehension assessment are presented as support for two other methods for construct- 
ing subscores. 

In this study, we evaluated reading comprehension subscores based, not on con- 
tent subareas such as the distinction between literal and inferential comprehension, 
but on the number of incorrect responses of various types committed by students. 
We also evaluated a score reflecting the temporal efficiency with which students 
arrive at correct answers: minutes per correct response. In addition to analyzing 
the internal structure of the total score and subscores, we examined the validity of 
the overall score and the incremental validity of subscores, over and above the 
overall score, in identifying at-risk students. For this purpose, we employed the 
logistic regression counterpart (Davison, Jew, & Davenport, 2014) of the linear 
regression procedure proposed by Davison and Davenport (2002; Davison, 
Davenport, Chang, Vue, & Su, 2015) 


Subscore Types 


The term subscore has at least two different meanings. First, it can refer to a score on 
a test within a test battery (e.g., the SAT Verbal, Quantitative, and Analytic subtests). 
Second, it can refer to subareas from a single test (e.g., literal and inferential 


Biancarosa et al. 67 


comprehension scores from a reading comprehension test). The former typically have 
the advantage that they reflect a wider variety of content areas. The latter have the 
advantage that they extract additional information without requiring students to take 
more than one test. Our reading of the literature on subscore utility leads to the con- 
clusion that subscores from a battery are more likely to have value over and above a 
total score than are subscores from a single test (Haberman, 2008a; Lyren, 2009; 
Puhan et al., 2010; Sinharay, 2010). However, it could be that the limitations of sub- 
scores drawn from a single test were related to how the subscores were defined. For a 
single test, defining subscores in terms of items from subcontent areas may be 
ineffective. 

The subscore types in the present research were drawn from theories of reading 
comprehension processing and observations from narrative comprehension think- 
aloud research (Carlson, Seipel, & McMaster, 2014; Graesser, Singer, & Trabasso, 
1994; Kintsch & van Dijk, 1978; McMaster et al., 2012, Rapp, van den Broek, 
McMaster, Kendeou, & Espin, 2007). This research has established that some who 
struggle with reading narrative passages can be considered specific poor compre- 
henders in that they struggle not with decoding (i.e., reading the words off a page), 
but rather with the comprehension process specifically (e.g., Rapp et al., 2007). 
Think-aloud research also indicates that poor comprehenders can be differentiated 
by at least two types of cognitive processes (e.g., paraphrasing and elaboration) on 
which they over rely often leading to incorrect responses (Carlson et al., 2014; 
McMaster et al., 2012; Rapp et al., 2007). Some students show a strong predilec- 
tion to paraphrasing information from a narrative even when the correct answer 
requires an inference. Others show a predilection for elaborations, evaluations, 
predictions, or associations that go beyond the literal information in the story but 
do not relate to the narrative sequence. Most authors refer to this second process 
as elaboration, because most such responses are elaborations of story information. 
However, because this second category includes more than just elaborations, it is 
here called lateral connection. In narrative comprehension, paraphrases and lateral 
connections represent important comprehension processes and can lead to correct 
answers. For instance, if an item is a literal comprehension task, a paraphrase will 
be the correct answer. In the context of inferential comprehension, however, which 
require inferences that complete a narrative sequence or make connections 
between two disparate pieces of text, neither paraphrases nor lateral connections 
complete the sequence of events. 

Unlike most traditional multiple-choice tests that have two types of responses for 
each item, correct and incorrect, the test on which this research is based has three 
types: correct, paraphrase, and lateral connect. The paraphrase and lateral connect 
types were created because they mimic the cognitive processes that appear in think- 
aloud responses, and because of their relationship to documented intervention out- 
comes. In classroom instruction, poor comprehenders responded differentially 
to intervention based on their preferred cognitive processes during reading 
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(McMaster et al., 2012; van den Broek et al., 2006). ‘‘Paraphrasers’’’ comprehension 
skills improved more in a general questioning condition (e.g., “Make a connection 
to what you previously read.’’), whereas “‘lateral connectors’? comprehension skills 
improved more from questioning about causal sequence (e.g., ‘““Why?’’). To date, 
this instructional finding has been found in classroom settings, but not small group, 
individualized instruction (McMaster, Espin, & van den Broek, 2014). 

Other tests have included reports of misconceptions or errors committed by stu- 
dents (delMas, Garfield, Ooms, & Chance, 2007; Hermann-Abell & DeBoear, 2011; 
Hestenes, Wells, & Swackhamer, 1992; Sadler, 1998). They are called distractor-dri- 
ven assessments by Hestenes et al. and concept inventories by Sadler. However, in 
these assessments, any particular error type typically appears in only a few items, and 
hence the frequency with which the error occurs is not a reliable score. For subscores 
to be useful, they must be reliable (American Educational Research Association, 
American Psychological Association, & National Council on Measurement in 
Education, 2014; Sinharay, 2010). In contrast, each incorrect response type on the 
assessment studied here appeared as a response option for every item. Thus, the para- 
phrase and lateral connect subscores were based on a sufficient number of items to 
yield reliable subscores. 

Another of our subscores is based on the efficiency with which students produce 
correct answers. It is conceptually similar to the index of fluency used to measure 
oral reading proficiency (i.e., correct words per minute), but is calculated and 
reported as the inverse of correct words per minute: i.e., minutes per correct response. 
Comprehension efficiency is determined by dividing the total testing time by the 
number of items answered correctly. In the reading literature, automaticity theory 
(LaBerge & Samuels, 1974), efficiency theory (Perfetti, 1985), or dual processing 
theory (Goldhammer et al., 2014) posit that as students become better readers, the 
cognitive reading process becomes more automatic, less effortful, less consciously 
controlled and thereby less time consuming. Consequently, one would expect that as 
reading comprehension improves, students would be able to reach correct responses 
at a more rapid efficiency. A faster comprehension efficiency is not a goal in and of 
itself, but rather it is evidence that the reading comprehension process is becoming 
more automatic. For the assessment used in this study, computerized administration 
makes it feasible to record the total testing time for each student, from which one can 
compute a students’ comprehension efficiency. 


Evaluating Validity via Diagnostic Accuracy 


Receiver operating characteristic (ROC) curves have become one of the standards for 
statistically evaluating the diagnostic accuracy of educational assessments for their 
utility as screening measures (Smolkowski & Cummings, 2015, 2016). ROC curve 
analyses test how well a measure predicts performance-level classification on a cri- 
terion measure (Silberglitt & Hintze, 2005). ROC curves visually depict the propor- 
tion of individuals who actually belong to a group who are correctly identified by the 
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screening measure as being in that group (i.e., sensitivity) relative to the proportion 
of individuals who do not actually belong in that group who are incorrectly desig- 
nated as belonging (i.e., | —specificity). ROC curves plot (1 — specificity) on the x- 
axis and sensitivity on the y-axis, resulting in a curve that begins in the lower left cor- 
ner where both proportions are zero, and rises toward the upper right corner, where 
both are 1. A curve that rises steeply toward the upper left corner of the plot indicates 
a more accurate predictor measure, because it represents a higher proportion of cor- 
rect classifications relative to incorrect classifications. Measures with poor diagnostic 
accuracy have an ROC curve close to the diagonal line on the plot, which indicates a 
50% probability of correct classification, the probability of assigning a correct classi- 
fication at random (Swets, Dawes, & Monahan, 2000). 

The area under the curve (AUC) statistic serves as an indicator of the overall diag- 
nostic accuracy of an assessment. AUC values range from .50 to 1.0, where higher 
values indicate higher classification accuracy. An AUC of .50 indicates chance accu- 
racy, and an AUC of 1.0 indicates perfect classification. Traditionally, an AUC 
between .50 and .70 indicates low accuracy, an AUC of .70 to .90 is good, and an 
AUC greater than .90 is excellent (Swets, 1988). Alternatively, the AUC value repre- 
sents the proportion of time a screener correctly identifies individuals with a certain 
condition (e.g., a learning disability) versus individuals without in a randomly 
selected pair (e.g., a student with learning disability and a student without learning 
disability). For example, an AUC of .80 means that in a randomly selected pair of 
students, one with a disability and one without, the measure correctly identifies the 
one with a learning disability in 80% of the trials. 

In this research, we first examined the added value of mistake subscores over and 
above the total number of mistakes using Haberman’s (2008a) value added analysis. 
Then we used a series of logistic regression and ROC curve analyses to evaluate the 
predictive validity of multiple types of subscores from a reading test in the identifica- 
tion of students at-risk for not reaching proficiency on a statewide test in Grades 3 to 
5. Our primary hypothesis was that the subscores would add value over and above a 
total score and that subscores would improve the prediction of proficiency over and 
above the total score alone, but that individual subscores may not be equally useful at 
all grades. That is, the more items answered incorrectly, the more information one 
has about the types of processes (i.e., paraphrase vs. lateral connect) to which the stu- 
dent is prone. Therefore, we hypothesized that incorrect answer propensity scores 
would be more useful among less proficient students in lower grades (i.e., when stu- 
dents tend to get more items incorrect). In contrast, automaticity theory suggests that 
automaticity emerges later in the reading development process (Goldhammer et al., 
2014; LaBerge & Samuels, 1974; Perfetti, 1985), suggesting that comprehension effi- 
ciency may be a better predictor among more proficient students and in later stages 
of reading development, when individual differences in automaticity are more pro- 
nounced. Thus, we hypothesized that comprehension efficiency may be more predic- 
tive in fifth grade than in third or fourth grade. 
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Method 
Participants 


Participants were 625 elementary students from 13 schools in two school districts in 
two western states in 2015-2016: 245 in third grade, 210 in fourth grade, and 170 in 
fifth grade. The sample included 337 (53.9%) females, 240 (38.4%) ethnic minority 
students, 233 students eligible for free and reduced lunch (FRL; 47.8% of the 487 
students for whom FRL data were available), 48 (7.7%) English language learners, 
and 45 (7.2%) students who received special education services. Our sample was 
predominantly White (61.6%) and Hispanic (24.5%) with smaller percentages of 
African American (3.4%), American Indian (1.8%), Asian (5.4%), Hawaiian (1.3%), 
and two or more races (1.4%). Whites were overrepresented and African Americans 
were underrepresented but all other groups were represented roughly in proportion to 
their representation in the K-12 population (U.S. Department of Education, National 
Center for Education Statistics, 2015). 


Assessments 

Smarter Balanced (SBAC) English Language Arts (ELA) Assessment. Developed by a con- 
sortium of 15 states, the SBAC ELA assessment (Smarter Balanced Assessment 
Consortium, 2016) is one of the Common Core State Standards aligned measures for 
Grades 3 to 8. It has two components, a computer-adaptive test and performance 
tasks that combine traditional assessment questions with interactive activities to 
assess students’ abilities to apply critical thinking and solve problems. In Grades 3 to 
5, the SBAC ELA contains between 43 and 47 items related to reading, writing, 
speaking, and research. The assessment is untimed, but the estimated total testing 
time is about 3.5 hours. SBAC proficiency was chosen as a criterion because educa- 
tors, policy makers, and parents are concerned about students achieving proficiency 
on statewide tests. The assessment was expected to be predictive of SBAC ELA pro- 
ficiency because both tests are in the area of English language literacy and both cover 
reading, although reading is only one component of the SBAC ELA assessment. 

As the criterion variable, we used district-provided achievement level information 
on SBAC ELA student classifications, which were available for a greater number of 
students than scale scores. The SBAC ELA measure provides four achievement levels 
based on the corresponding scale scores: | = not meeting the state ELA achievement 
standards; 2 = nearly meeting the state ELA achievement standards; 3 = meeting the 
state ELA achievement standards; and 4 = exceeding the state ELA achievement 
standards. We classified students at Levels 1 to 2 as not proficient and those at Levels 
3 to 4 as proficient. For the analysis, SBAC performance was coded as | = not profi- 
cient and 0 = proficient. 


Multiple-choice Online Causal Comprehension Assessment (MOCCA). The Multiple- 
choice Online Causal Comprehension Assessment (MOCCA) is a multiple-choice, 
online assessment designed to identify comprehension processes for students in 
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Grades 3 to 5. It has nine computer-administered, 40-item forms, with three at each 
grade level. Participants were randomly assigned to take one of the three forms at 
their grade level. Each MOCCA item consists of a seven-sentence story in which the 
sixth sentence is removed. For each item, three response types are presented: a cau- 
sally coherent inference, a paraphrase, and a lateral connection. The causally coher- 
ent inference represents the correct answer (i.e., the original sixth sentence) and 
indicates full comprehension of the item-story. The other two incorrect responses are 
used to identify patterns in the types of processes students tend toward when not 
comprehending fully (i.e., paraphrase or lateral connection). The test itself imposes 
no time limit, but testing typically occurred during one class period (i.e., 30-60 min- 
utes), with a mean student testing time of 35 minutes across grades. 

MOCCA provides a total of six scores of interest to this research, five of which 
were used as predictors. The first two are the number correct (NC) out of 40 possible 
items, and its inverse, the number of incorrect responses (NI) out of 40 possible items 
(i.e., NI = 40 — NC). There are three ways in which a student can fail to answer an 
item correctly: choosing the paraphrase response option, choosing the lateral connect 
response option, or failing to reach an item in the time allowed. This leads to three 
subscores, each with a range of zero to 40: the number of paraphrase responses (NP), 
the number of lateral connect responses, (NL), and the number of items not-reached 
(NR). The first five scores are related, such that NI = 40 -NC = NP + NL + NR. In 
the analyses below, NI and NC are interchangeable, but inversely related measures of 
a student’s overall performance on the test. We emphasize NI rather than NC, 
because it represents the sum of the three incorrect responses NP + NL + NR, and 
this summative relationship leads directly to methods for comparing the performance 
of total score NI vs. the three incorrect response scores (Davison & Davenport, 2002; 
Davison et al., 2015; Haberman, 2008a). The sixth score of interest is the comprehen- 
sion efficiency (CE), the student’s total testing time divided by the number of correct 
responses, in number of minutes and seconds. 


Procedures 


Schools were recruited through local connections and the DIBELS Data System 
(DDS) at the University of Oregon. Students with parental consent took MOCCA in 
groups either in their school computer lab or in their classrooms on computers or 
tablets between February and June of 2016. At the end of the school year, participat- 
ing districts shared students’ state assessment scores for that year. To evaluate our 
hypotheses regarding the incremental utility of process propensity and comprehension 
efficiency subscores, we conducted logistic regression and ROC curve analyses for 
each grade. For the logistic regression analyses, we tested a series of three increas- 
ingly complex models aligned with our hypotheses. Our first two models were chosen 
because comparing the fit of the two models directly addresses the question of 
whether subscores add to validity over a total score. Model | includes only one pre- 
dictor, the student’s total number incorrect (NI). A key feature of Model 1 is that it 
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assigns the same expected probability to every student with the same total score. This 
provides a baseline estimate of the extent to which the total score is predictive of 
SBAC outcomes, without the inclusion of any subscale scores. Given the relationship 
between number correct (NC) and total number incorrect (NI = 40 — NC), it would be 
redundant to repeat this analysis using NC as the criterion. 

Model 2 contains the three incorrect score predictors: NP, NL, and NR. A key fea- 
ture of Model 2 is that it assigns different expected probabilities to people with the 
same total score NI but different patterns of incorrect responses. The model of 
Equation (2) is a hierarchically embedded submodel of Equation (1) in which all of 
the linear coefficients in Equation (1) are constrained to be equal. That is, if 7 is the 
probability of being not proficient for a predictor vector (NP, NL, and NR), the logit 
for Model 2 can be expressed as: 


Ln(; = ) =b:NP+b.NL +b3NR +a (1) 
— 7. 


In the hierarchically embedded submodel with all three weights equal to b, Equation 
(1) becomes: 


Ln(; is ) =b(NP+NL+NR) +a (2) 
=bNI+a (3) 


because NI = NP + NL + NR. Equation (3) represents Model 1 with only a single 
predictor, NI. 

Model 3 allows us to test whether taking efficiency into account improves predic- 
tion. It represents a further extension of Model 2, adding a fourth predictor to the 
three incorrect response types in Model 2: comprehension efficiency CE. Model 2 
can thus be considered a hierarchically embedded submodel of Model 3 in which the 
weight for CE is constrained to 0. 


Results 
Subscores 


For subscores to be useful, they must be reliable. Across forms and grades, the relia- 
bility (alpha) of the NC (or NI) scores ranged from .93 to .94, the NP reliabilities 
from .86 to .89, the NL score reliabilities from .72 to .82, and the NR score reliabil- 
ities from .96 to .97. The NR reliabilities are almost certainly inflated by a lack of 
independence of the NR response variable across items at the end of the test, but to 
date, the only available reliability estimates are internal consistency estimates. 

For subscores to be useful, they must also be distinct. For validity purposes, this 
means that their intercorrelations should not be too high. Lyren (2009) and McPeek, 
Altman, Wallmark, and Wingersky (1976) suggest an upper limit of .90 for the disat- 
tenuated correlations. Sinharay (2010) proposed an upper limit of .80 for the average 
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disattenuated correlation among the subscores, adding that ‘‘it is possible to find 
unique tests for which these figures do not provide accurate guidance’”’ (p. 169). For 
our subscores, the average (over three forms) disattenuated correlation estimates for 
NP and NL were .88, .82, and .81 in third, fourth, and fifth grades, respectively. For 
NP and NR, the corresponding figures were —.35, —.17, and —.51, and for NL and 
NR, they were —.51, —.22, and —.22. 


Added Value of Subscores 


Haberman (2008a) proposed an analysis that uses the reliability of a subscore, the 
reliability of the total score, and the correlation of the subscore with the total score 
to determine if subscores add value over and above a total score. Since our subscores 
NP, NL, and NR add to a total incorrect score NI, the analysis can be applied to these 
subscores. To describe the analysis, consider one of the subscores NP. For each per- 
son, one can conceive a true score on a paraphrase propensity dimension manifested 
in the subscore NP. If one wants to estimate a person’s true paraphrase score, there 
are three ways to estimate that true score: (1) estimate it from the observed NP score, 
(2) estimate it from the observed total number of errors NI, or (3) estimate it using 
both NP and NI. In Haberman’s analysis, one asks the question: Which of these three 
methods of estimating the true paraphrase score will yield the most precise true score 
estimate where precision is measured by the root mean square error (RMSE) of esti- 
mation? The method with the smallest RMSE gives the most precise estimate. A sub- 
score can be said to add value if the RMSE estimating with the subscore alone or the 
subscore in combination with the total score is smaller than the RMSE estimating 
with the total score alone. 

Table 1 shows the RMSE for each method of estimating the true score for each of 
our three incorrect types in all three grades. For instance, for the paraphrase incorrect 
type in third grade, the RMSE was 4.57 if the true score is estimated from the total 
observed score NI, 2.13 if the true score is estimated from the observed subscore NP, 
and 2.12 if estimated from both NI and NP. Because the RMSE estimating with NP 
or both NP and NI is smaller than the RMSE for estimating with the total score NI 
alone, the subscore can be said to add value over the total score. In every grade, the 
RMSE is smaller for estimation with the observed subscore or with the observed sub- 
score and total score NI than the RMSE for estimation with the total score NI alone. 
Estimating with the observed subscore yields a smaller RMSE than estimating from 
the total score and adding the total score to the subscore improves estimation very lit- 
tle. As added value is defined by Haberman, all of our subscores add value over and 
above the total incorrect score NI at every grade. 


Means of Proficient and Nonproficient Students 


Table 2 shows the mean scores of proficient and non-proficient students by grade on 
the six scores described above: NC, NI, NP, NL, NR, and CE. Proficient and non- 
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Table |. Value Added Analysis: Root Mean Square Errors (RMSE) for Estimating Subscore 
True Scores From Observed Total Score, Observed Subscore, and Both. 


Grade Subscore RMSE from total score RMSE from subscore RMSE from both 


3 # Paraphrase 4.57 2.13 2.12 
3 # Lateral Cnct. 3.30 1.92 1.87 
3 # Not Reached 7.40 1.42 1.39 
4 # Paraphrase 4.03 1.89 1.88 
4 # Lateral Cnct. 3.15 1.76 1.72 
4 # Not Reached 6.76 1.36 1.34 
5 # Paraphrase 3.27 1.79 1.78 
5 # Lateral Cnct. 2.70 1.82 1.68 
5 # Not Reached 5.68 1.11 1.11 


Note. Lateral Cnct. = lateral connect. 


proficient students differ significantly on all six measures at every grade. For the two 
overall scores, NC and NI, effect sizes are large, ranging between 1.3 and 1.6 in abso- 
lute value over the grades. For NP and NL, effect sizes are somewhat smaller, but 
still generally large in absolute value, ranging from —0.71 to — 1.01. The effect sizes 
for NR are somewhat smaller in absolute value, ranging from —0.33 to —0.74. Of 
the three types of incorrect responses, the number of items not reached is less highly 
associated with proficiency at each of the grades. The effect sizes for CE are also rel- 
atively large, ranging from -0.64 to -1.02. 


Logistic Regression 


Table 3 shows the fit measures for the logistic regression models with the dichoto- 
mous SBAC proficiency variable as the criterion at each grade. At every grade, the 
likelihood ratio tests comparing Models | and 2 lead to rejection of Model 1, which 
assigns equal expected probabilities to everyone with the same total score, in favor of 
Model 2, which distinguishes among students with the same total score based on their 
pattern of incorrect responses. In addition, the AIC and the BIC are lower for Model 
2 than Model 1 in every grade, suggesting that Model 2 fits the data better than 
Model 1, even after accounting for the two additional parameters in Model 2. As one 
measure of effect size, Table 3 also contains Nagelkerke’s pseudo-R* (Nagelkerke, 
1991), a measure of the extent to which the addition of parameters improves the pre- 
diction of a logistic regression model. We chose to report Nagelkerke’s pseudo-R? 
because it ranges between 0.0 and 1.0 and thus has the same range as the familiar R. 
However, unlike the familiar R’, it does not have a proportion of variance interpreta- 
tion, and it improves as a function of the likelihood rather than variance accounted 
for. Using three predictors improves Nagelkerke’s pseudo-R? by .05 to .10, depend- 
ing on the grade. These results support the hypothesis that Model 2, with separate 
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Table 2. Means, Standard Deviations, and Effect Sizes for MOCCA Scores by Grade and 
Proficiency. 


M SD 


Score Proficient Not Proficient Proficient Not Proficient g 


Third grade (number proficient = 120; number not proficient = |25) 


NC 26.87 14.96 8.57 7.87 1.44** 
NI 13.13 25.04 8.57 7.87 —1.44** 
NP 3.28 8.17 4.23 5.48 -0.99** 
NL 3.10 6.94 2.54 4.57 —1.03** 
NR 6.75 9.93 8.84 10.37 —0.33* 

CE 1.46 2.63 1.09 1.70 -0.81** 

Fourth grade (number proficient = 119; number not proficient = 91) 
NC 31.08 16.65 8.84 9.69 1.56** 
NI 8.92 23.35 8.84 9.69 —1.56** 
NP 1.97 6.37 3.11 5.54 —1.01** 
NL 2.23 5.10 2.53 4.21 —0.85** 
NR 4.72 11.88 8.38 11.45 -0.73** 
CE 1.06 2.71 0.56 3.85 —0.64** 
Fifth grade (number proficient = 101; number not proficient = 69) 

NC 31.89 20.49 8.15 9.56 1.30** 
NI 8.11 19.51 8.15 9.56 —1.30** 
NP 1.88 4.17 2.63 3.94 -0.71** 
NL 1.91 4.55 2.32 3.96 —0.85** 
NR 4.32 10.78 7.87 9.8| -0.74** 
CE 0.96 1.84 0.40 1.25 —1.02** 


Note. MOCCA = Multiple-choice Online Causal Comprehension Assessment; NC = number correct; NI 
= number incorrect; NP = number of paraphrases; NL = number of lateral connects; NR = number not 
reached; CE = comprehension efficiency (minutes/correct). 

*b < 05. **p < Ol. 


subscores, better accounts for the data than the simpler Model 1, with only total 
incorrect score as a predictor. 

As hypothesized, the results for Model 3 vary by grade. In third grade, the likeli- 
hood ratio test (Model 2 vs. Model 3) failed to reject (vy > .05) the more parsimo- 
nious Model 2, with just response-based subscores. In fourth and fifth grades, 
however, the likelihood ratio tests led to rejection of Model 2 that does not include 
comprehension efficiency in favor of the model that does. Although the improvement 
is small, adding comprehension efficiency improves Nagelkerke’s pseudo-R* by .01 
in fourth grade and .05 in fifth grade. In addition, the AIC and the BIC are lower for 
Model 3 than Model 2, suggesting that Model 3 better fits the data than either Model 
1 or 2, even after taking into account the additional parameter in Model 3. 

Table 4 shows the regression weights and their standard errors by grade for 
Models 2 and 3. For each of the three incorrect response variables, the unit is the 
same, one item. Thus, for each incorrect response variable, the unstandardized 
regression weight indicates the amount by which the expected logit increases with a 
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Table 3. Logistic Regression Fit Statistics, Nagelkerke R?, and Likelihood Ratio Test (LRT) for 
Comparing Each Model to the Model Above It. 


Model —2LL AIC BIC R? LRT 
Third grade 

Model | 242.34 246.34 253.34 0.437 

Model 2 212.99 220.99 234.99 0.538 29.35** 

Model 3 212.99 222.99 240.50 0.538 0.00 
Fourth grade 

Model | 197.45 201.45 208.15 0.467 

Model 2 184.15 192.15 205.53 0.521 13.30** 

Model 3 178.82 188.82 205.53 0.537 5.25* 
Fifth grade 

Model | 175.97 179.97 186.25 0.365 

Model 2 166.61 174.61 187.15 0.418 9.36** 

Model 3 158.84 168.84 184.52 0.460 7.77** 


Note. LL = log likelihood; AIC = Akaike information criterion; BIC, Bayesian information criterion. 
*b < 05. **pb < 01. 


one-item increase in the predictor. In third grade, the three incorrect response vari- 
ables are significant at p < .01 in both models, but the addition of comprehension 
efficiency in Model 3 is not. In fourth grade, all of the predictors in both models are 
significant at p < .05. In fifth grade, NP is not significant in either model. In Model 
3, CE is significant at p < .05, but NR is not. 

Out of concern for multicollinearity, variance inflation factors (VIF) were com- 
puted for every predictor in each of our models. The largest VIF was 2.32 for NP in 
the model with four predictors (NP, NL, NR, CR) in third grade. Kutner, Nachtsheim, 
Neter, and Li (2005, p. 409) suggest that VIF > 10 be taken as evidence for serious 
multicollinearity. 

At the level of the individual student, the choice of model can make a major dif- 
ference if the assessment is used to identify which students are at risk and in need of 
remediation. For instance, in the logistic regression analysis and using a predicted 
probability of .5 as the cut-score separating those who are and are not at risk, 31 or 
13% of third graders would be classified differently depending on whether Model 1 
or Model 2 probabilities were used. If these probabilities were used in deciding who 
was eligible for remediation, there would be 31 children for whom eligibility recom- 
mendation would depend on the model chosen. In fourth and fifth grade, 22 (10%) 
and 15 (9%) of students’ eligibility recommendations would depend on the model 
chosen. If the program is an effective program, eligibility represents a high stakes 
decision for the child. 


ROC Curve Analyses 


The three panels of Figure 1 show the results of the ROC curve analyses by grade. In 
the ROC analysis, model predicted probability of being not proficient was used to 
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Table 4. Unstandardized Logistic Regression Weights with Standard Errors for Models 2 and 
3 by Grade. 


Model 2 Model 3 
Variable B SEB B SEB 
Third grade 
NP 0.132** 0.042 0.132** 0.048 
NL 0.421 ** 0.077 0.422** 0.078 
NR 0.147** 0.023 0.147** 0.025 
CE —0.003 0.156 
Fourth grade 
NP 0.263** 0.063 0.212** 0.067 
NL 0.155* 0.070 0.136* 0.069 
NR 0.119** 0.019 0.083** 0.025 
CE 0.710* 0.347 
Fifth grade 
NP 0.112 0.074 0.040 0.080 
NL 0.310** 0.083 0.264** 0.086 
NR 0.1 12** 0.021 0.045 0.032 
CE 1.495* 0.595 
Note. Dependent variable was coded | = not proficient, 0 = proficient. NP = number of paraphrases; NL 


= number of lateral connects; NR = number not reached; CE = comprehension efficiency (minutes/ 
correct). 
*p < 05. **p < Ol. 


predict actual not proficient status. In all grades, the AUC exceeds .80 for every 
model, indicating that, for MOCCA, the total score alone provides a relatively accu- 
rate diagnosis of SBAC proficiency. In third grade, the AUC for Model | was .84 
(CI = .79-.89), the AUC for Model 2 was .88 (CI = .84-.92), and the AUC for Model 
3 was .88 (CI = .84-.92). In fourth grade, the AUC for Model | was .86 (CI = .80- 
.91), the AUC for Model 2 was .88 (CI = .83-.92), and the AUC for Model 3 was .88 
(CI = .84-.93). In fifth grade, the AUC for Model 1 was .81 (CI = .75-.88), the AUC 
for Model 2 was .83 (CI = .77-.90), and the AUC for Model 3 was .85 (CI = .79- 
91). 

The addition of subscore data consistently improved the accuracy of the diagnosis, 
both across grades and across the distribution of student skill, as measured by both 
the AUC statistic and a visual inspection of the curve for each model. That is, in each 
grade, there are large sections of the plot where the curve for Model 2 is clearly above 
the curve for Model 1. In third grade, the curves for Models 2 and 3 are nearly identi- 
cal, emphasizing that the addition of comprehension efficiency in third grade does 
not improve the model. In contrast, the plots for both fourth and fifth grade have areas 
(i.e., ranges of student scores) where the curve for Model 3 is clearly above the curve 
for Model 2. 
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Figure |. Receiver operating characteristic (ROC) curve results for three models in each of 
three grades. 


Model Differences for Individual Students 


The impact of Model 1 versus Model 2 differences for individual students is best illu- 
strated by examining the expected probability for a group of students with the same 
overall score and different incorrect response patterns. Table 5 illustrates this differ- 
ence for 13 third grade students, all of whom had the same number of items correct 
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Table 5. Subscores, Model Probabilities, and Proficiency for 13 Third Grade Students With 
the Same Overall Score. 


Subscore Model | Model 2 Model 3 


NC NI NP NL NR_~ CE = Actual class. Prob. Prob. Atrisk Prob. At risk 


17 23 O O 23 1.01 0 .6473 3409 0 3415 0 
17 23 0 2 21 0.73 0 6473 4726 0 4735 0 
17 23 20 3 0 2.86 0 .6473 4681 0 4683 0 
17-23 | 3 19 1.86 I 6473 5376 | 5376 | 
17 23 2 5 16 1.53 I .6473 6650 I 6653 I 
17 23 17 6 0 3.94 0 .6473 677 I 6765 I 
17 23 15 8 0 1.94 I 6473 —.789 I .7896 I 
17 23 14 9 0 1.79 I 6473 — 8333 I 8337 I 
17 23 6 10 7 1.83 I .6473 8809 l 881 | I 
17 23 Il 12 0 1.58 I 6473, 9225 I 9228 I 
17 23 Il 12 0 1.60 I 6473, 9225 I 9228 I 
17 23 9 14 0 5.10 I 6473 .955| I 9547 I 
17 23 8 15 0 2.36 I .6473 —.9660 | .9660 I 


Note. CE = comprehension rate; NC = number correct; NI = number incorrect; NP = number of 
paraphrase responses chosen; NL = number of lateral connection options chosen; NR = number of items 
not reached; Actual class = observed proficiency classification (0 = not at risk, | = at risk) based on 
Smarter Balanced Assessment Consortium (SBAC) performance; Prob. = model estimated probability of 
student risk status; At risk = model predicted risk status based on subscale scores. 


(17, in Column 1), and the same number of incorrect answers (23, in Column 2). 
However, these students differ in their patterns of incorrect responses, and they have 
been rank ordered from lowest to highest based on their number of NL responses, the 
predictor with the largest weight in third grade. Column 7 shows the Model 1 pre- 
dicted probability of being at -risk (.6473) for each student, all of which are the same, 
because for the only predictor in Model 1 (i.e., NI), these students all have the same 
score, 23. Because this predicted value is greater than .5, the logistic regression model 
classified all 13 in the at-risk group. 

Column 8 shows the predicted probabilities for Model 2. Unlike the previous col- 
umn, they are not equal, and range from .3409 to .9660. Because the predicted prob- 
ability of risk was less than .5 for three of the 13, three students were classified as 
not at risk by Model 2, but not Model 1. All three students did in fact score at the 
proficient level on the SBAC, making the not-at-risk designation the correct classifi- 
cation. If one were placing these students in an intervention on the basis of these two 
models, the results would be very different for these three students. Similarly, 
Column 10 shows the predicted probabilities for Model 3, which range from .3415 to 
.9960. Because CE did not improve the model in third grade, the same students are 
predicted to be at risk by Model 2 and Model 3. As illustrated here, Model | assigns 
equal at-risk probabilities to every student with the same total score. Models 2 and 3 
distinguish between students with the same total score based on their pattern of 
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incorrect responses. In this example, ranking students by their NL scores ranks them 
from low to high on their Model 2 predicted at-risk probabilities, because the NL 
responses have the highest regression weight. In all grades, given equal number 
incorrect scores, students with a predominance of NR incorrect responses were at 
comparatively small risk. Especially in third and fifth grades, those with a predomi- 
nance of NL incorrect responses were at highest risk. This can be seen in Table 5 by 
comparing predicted probabilities for those with a strong predominance of NR 
responses (first two rows) with the probabilities for those having a predominance of 
LC responses (last two rows). 


Discussion 


The simple means presented in Table 2 make it clear that proficient and non- 
proficient students (as measured by SBAC) differed in multiple ways on their perfor- 
mance on MOCCA. They differed in how many items they answered correctly, in 
their selection of both types of incorrect responses, and in the numbers of items they 
failed to reach. Moreover, they differed not only in how accurately they answered 
questions and the incorrect responses they preferred but also in the efficiency with 
which they correctly answered (1.e., the number of minutes per correct item). This 
latter finding is consistent with the hypothesis that as reading improves, it becomes 
more automatic, and with automaticity comes faster efficiency (LaBerge & Samuels, 
1974; Logan, 1997; Perfetti, 1985; Samuels, Ediger, Willcutt, & Palumbo, 2008; 
Samuels & Flor, 1997). This latter finding is also consistent with earlier ones on oral 
reading comprehension rate (e.g., Neddenriep, Hale, Skinner, Hawkins, & Winn, 
2007; Skinner, Neddenriep, Bradley-Klug, & Ziemann, 2002; Skinner et al., 2009). 
However, the contribution of comprehension efficiency, over and above that of the 
incorrect response scores, is small as evidenced by the similarity of the curves for 
Models 2 and 3 in Figure 1. 

Including subscores in prediction models, rather than just an overall score, had 
three effects. First, it led to models that better accounted for the data. Compared to 
the total score only model, including the three incorrect response scores resulted in a 
better fit: lower AIC and BIC values, a higher Nagelkerke pseudo-R’, and a signifi- 
cant likelihood ratio test in all three grades. These findings support the conclusion 
that, given carefully constructed alternatives, the pattern of student incorrect 
responses can add information over and above that provided by the total score alone. 
These subscores indicate that students with the same total score may not be equally 
at-risk. As evidenced by the ROC graphs, incorrect responses were especially infor- 
mative in third grade, where students tend to make more mistakes overall. 

Similarly, adding comprehension efficiency improved prediction and model fit in 
fourth and fifth grade: lower AIC and BIC values, as well as significant likelihood 
ratio tests. That this finding did not hold for third grade seems consistent with auto- 
maticity theory, which posits that the reading comprehension process is initially slow 
and controlled, but with practice becomes more rapid and automatic (LaBerge & 
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Samuels, 1974; Logan, 1997; Perfetti, 1985; Samuels et al., 2008; Samuels & Flor, 
1997). However, automaticity theory is not clear about when the transition to more 
automatized processing occurs. Practically important differences in automaticity may 
not occur until somewhat later in reading comprehension development, in which 
case, comprehension efficiency might not be a good predictor until somewhat later 
in the developmental process. Second, adding the incorrect response scores changed 
the ROC curves. In third grade, the ROC curve for Model 2 is higher than that for 
Model | at almost every point on the curve. In fourth and fifth grade, the ROC curve 
is, with few exceptions, as high or higher at every point on the curve. The AUC is 
lowest for Model | at every grade. Judging by the ROC curves, models with more 
than a single predictor seemed to do as well as or better than those with only a total 
score at almost all points along the sensitivity continuum. 

The third effect of including subscores occurred at the level of the individual stu- 
dent. When an intervention is highly effective, the decision to provide an interven- 
tion (or not) can have a large impact on student outcomes. In the grades evaluated 
here, the decision of whether to provide an intervention changed for approximately 
10% to 15% of students when based on a single total score rather than the pattern of 
incorrect responses. Adding comprehension efficiency had a smaller impact, but for 
a given student, the consequences could be substantial. 


Investigating Subscores 


Just as reliability limits the validity of a single test, the reliability of the total score 
and its subscores limits the incremental validity of subscores, and Haberman (2008b) 
derives an estimate of that upper limit. However, we were not interested in the upper 
limit of subscore incremental validity, but rather the actual increment to validity for a 
specific criterion variable. Davison et al. (2015) describe a linear model for this pur- 
pose given a continuous criterion variable. Here we illustrated how their procedure 
can be extended to a categorical criterion variable based on a logistic model. Our 
analysis also illustrates how, given the linear relationship between number correct 
and number incorrect, the procedures of both Davison et al. (2015) and Haberman 
(2008a) can be extended to an analysis of subscores based on incorrect responses. 
Most importantly, we have described a method of constructing incorrect responses to 
multiple-choice items that, at least in this case study, yielded subscores with added 
value and validity 

Although the current study is a case study of a single assessment, it does point to 
the possibility of constructing subscores that add information over and above a total 
score by using carefully constructed distractors each corresponding to a cognitive 
process leading to an identifiable processing preference. It also suggests that effi- 
ciency measures may add information over and above a single total score. Efficiency 
measures may be particularly valuable in the case of computer-administered assess- 
ments where testing times can be recorded with no additional effort on the part of 
the test administrator. For individually administered tests or paper/pencil tests, it 
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may be impractical to compute efficiency indices. In addition, the use of incorrect 
answer patterns and efficiency scores may also apply beyond the assessment of read- 
ing. Like the conclusions from any single study, however, these suggestions must be 
viewed with caution pending replication with other reading assessments and in other 
domains. 
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